The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Goyal, Naman; Gao, Cynthia; Chaudhary, Vishrav; Chen, Peng-Jen; Wenzek, Guillaume; Ju, Da; Krishnan, Sanjana; Ranzato, Marc'Aurelio; Guzman, Francisco; Fan, Angela

Computer Science > Computation and Language

arXiv:2106.03193 (cs)

[Submitted on 6 Jun 2021]

Title:The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Authors:Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzman, Angela Fan

View PDF

Abstract:One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2106.03193 [cs.CL]
	(or arXiv:2106.03193v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2106.03193

Submission history

From: Angela Fan [view email]
[v1] Sun, 6 Jun 2021 17:58:12 UTC (1,898 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-06

Change to browse by:

cs
cs.AI

References & Citations

DBLP - CS Bibliography

listing | bibtex

Naman Goyal
Vishrav Chaudhary
Peng-Jen Chen
Guillaume Wenzek
Da Ju

…

export BibTeX citation

Computer Science > Computation and Language

Title:The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators